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e Unlike TM, NTM is completely differentiable 
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— Long Short Term Memory RNN's designed to 
handle vanishing and exploding gradient 


— Natively handle variable length structures 
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e 3. Addressing 


— 1. Focusing by Content 
* Each head produces key vector k, of length M 


e Generated a content based weight w,“ based on 
similarity measure, using ‘key strength’ B, 
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e 3. Addressing 


— 2. Interpolation 
* Each heademits a scalar interpolation gate g, 
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e 3. Addressing 


— 3. Convolutional shift 


e Each head emits a distribution over allowable integer 
shifts s, 


N—I 


Di — X w(i) si —5) 


j=0 


Neural Turing Machines 


e 3. Addressing 
— 4. Sharpening 


* Each head emits a scalar sharpening parameter y, 
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e 3. Addressing (putting it all together) 


— This can operate in three complementary modes 
e Aweighting can be chosen by the content system 
without any modification by the location system 
e Aweighting produced by the content addressing 
system can be chosen and then shifted 
e Aweighting from the previous time step can be rotated 


without any input from the content-based addressing 
system 
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e Controller Network Architecture 
— Feed Forward vs Recurrent 


— The LSTM version of RNN has own internal 
memory complementary to M 


— Hidden LSTM layers are ‘like’ registers in 
processor 

— Allows for mix of information across multiple 
time-steps 

— Feed Forward has better transparency 
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e Test NTM's ability to learn simple algorithms 
like copying and sorting 

e Demonstrate that solutions generalize well 
beyond the range of training 

e Tests three architectures 
— NTM with feed forward controller 
— NTM with LSTM controller 
— Standard LSTM network 
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e 1. Copy 


— Tests vvhether NTM can store and retrieve data 
— Trained to copy sequences of 8 bit vectors 


— Sequences vary betvveen 1-20 vectors 
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Experiments 


e 2. Repeat Copy 


— Tests whether NTM can learn simple nested 
function 


— Extend copy by repeatedly copying input specified 
number of times 


— Training is a random-length sequence of 8 bit 
binary inputs plus a scalar value for # of copies 


— Scalar value is random between 1-10 
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Figure 7: Repeat Copy Learning Curves. 
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e 2. Repeat Copy 
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e 3. Associative Recall 
— Tests NTM's ability to associate data references 


— Training input is list of items, followed by a query 
item 


— Output is subsequent item in list 
— Each item is a three sequence 6-bit binary vector 


— Each ‘episode’ has between two and six items 


Experiments 


e 3. Associative Recall 


LSTM —— 
NTM with LSTM Controller —=— 
NTM with Feedforward Controller —— 


cost per sequence (bits) 


200 400 600 800 1000 


sequence number (thousands) 


Figure 10: Associative Recall Learning Curves for NTM and LSTM. 
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e 3. Associative Recall 
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Figure 11: Generalisation Performance on Associative Recall for Longer Item Sequences. 
The NTM with either a feedforward or LSTM controller generalises to much longer sequences 
of items than the LSTM alone. In particular, the NTM with a feedforward controller is nearly 
perfect for item sequences of twice the length of sequences in its training set. 


Experiments 


e 3. Associative Recall 


GC 
O 
5 
© 
U 
O 
«ad 


Time —> Time —> 


Write Weightings 


Read Weightings 


speoy 


Experiments 


e 4. Dynamic N-Grams 


— Test whether NTM could rapidly adapt to new 
predictive distributions 


— Trained on 6-gram binary pattern on 200 bit 
sequences 


— Can NTM learn optimal estimator 


Ni +3 


P(B = 1|Ni, No, c) = M+M+1 
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e 4. Dynamic N-Grams 


160 


LSTM —— 
NTM with LSTM Controller —=— 
NTM with Feedforward Controller —— 
Optimal Estimator 


155 
150 
145 
140 
135 


cost per seguence (bits) 


130 
0 200 400 600 800 1000 


sequence number (thousands) 


Figure 13: Dynamic N-Gram Learning Curves. 


Experiments 


e 4. Dynamic N-Grams 


Experiments 
e 4. Dynamic N-Grams 


Add Vectors 


Write Weights 


Predictions El | 
Inputs d Tn 


Read Weights 


Experiments 


e 5. Priority Sort 
— Tests whether NTM can sort data 


— Input is sequence of 20 random binary vectors, 
each with a scalar rating drawn from [-1, 1] 


— Target sequence is 16-highest priority vectors 
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e 5. Priority Sort 
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Experiments 


e 6. Details 
— RMSProp algorithm 
— Momentum 0.9 
— All LSTM's had three stacked hidden layers 
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e 6. Details 
Task #Heads Controller Size Memory Size Learning Rate Parameters 
Copy 1 100 128 x 20 1074 17, 162 
Repeat Copy 1 100 128 x 20 1074 16, 712 
Associative 4 256 128 x 20 1074 146, 845 
N-Grams 1 100 128 x 20 3 x 107° 14, 656 
Priority Sort 8 512 128 x 20 3 x 107° 508, 305 


Table 1: NTM with Feedforward Controller Experimental Settings 


Experiments 


e 6. Details 
Task #Heads Controller Size Memory Size Learning Rate #Parameters 
Copy 1 100 128 x 20 1074 67, 561 
Repeat Copy 1 100 128 x 20 1074 66,111 
Associative 1 100 128 x 20 1074 70, 330 
N-Grams 1 100 128 x 20 3 x 105 61, 749 
Priority Sort 5 2 x 100 128 x 20 3 x 107° 269, 038 


Table 2: NTM with LSTM Controller Experimental Settings 


Experiments 


e 6. Details 
Task Network Size Learning Rate #Parameters 
Copy 3 x 256 3 x 107° 1, 352, 969 
Repeat Copy 3 x 512 3 x 107° 5,312, 007 
Associative 3 x 256 1074 1,344, 518 
N-Grams 3 x 128 1074 331, 905 
Priority Sort 3 x 128 sx 10° 384, 424 


Table 3: LSTM Network Experimental Settings 


Conclusion 


e Introduced an neural net architecture with 
external memory that is differentiable end-to- 
end 


e Experiments demonstrate that NTM are 
capable of leaning simple algorithms and are 
capable of generalizing beyond training 
regime 


“ Again, it [the Analytical Engine] might act upon other things besides 


numbers...the engine might compose elaborate and scientific pieces of 


music of any degree of complexity or extent.” — Ada Lovelace 


